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Dust is a common cause of health risks and also a cause of climate change, 
one of the most threatening problems to humans. In the recent decade, climate 
change in Iraq, typified by increased droughts and deserts, has generated 
numerous environmental issues. This study forecasts dust in five central Iraqi 
districts using machine learning and five regression algorithm supervised 
learning system framework. It was assessed using an Iraqi meteorological 
organization and seismology (IMOS) dataset. Simulation results show that the 
gradient boosting regressor (GBR) has a mean square error of 8.345 and a 
total accuracy ratio of 91.65%. Moreover, the results show that the decision 
tree (DT), where the mean square error is 8.965, comes in second place with 
a gross ratio of 91%. Furthermore, Bayesian ridge (BR), linear regressor (LR), 
and stochastic gradient descent (SGD), with mean square error and with 
accuracy ratios of 84.365%, 84.363%, and 79%. As a result, the performance 
precision of these regression models yields. The interaction framework was 


designed to be a straightforward tool for working with this paradigm. This 
model is a valuable tool for establishing strategies to counter the swiftness of 
climate change in the area under study. 
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1. INTRODUCTION 

Dust forecasting predicts airborne dust particles’ amount, location, and movement over a specific 
region or area. The primary goal of dust forecasting is to warn the public and relevant authorities of any 
potential hazards caused by high dust levels in the atmosphere. Dust information is crucial to various industrial 
industries, green technology, and smart grid and has environmental, agricultural and economic effects. If data 
are abundant, empirical methodologies are employed as dust forecasting methods for predicting local-scale 
dust variables [1]. 

Many repercussions of climate change have become evident in Iraq; drought is one of them, 
particularly over the past decade. Iraq’s drought has worsened due to a combination of circumstances, including 
illegal migration caused by a series of wars, inefficient management of water resources, and a large variety of 
other factors [2]. The data collection stations of dust phenomena used for training in five Iraqi governorates 
are located in the center of southern Iraq. Accredited by world meteorological organization (WMO) standards 
are shown in Table 1. 

Baghdad is the capital, the biggest province in population, and has the highest population density, 
accounting for 21.3% of Iraq’s population. Its population exceeded nine million people. Its total land area 
measures 5,169 km?. The population of s Kut Karbala Najaf and Hilla is seven million, and it is a province 
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adjacent to Baghdad that is significant for economic, commercial, and religious tourism. Although the central 
regions of Iraq were badly impacted by sand and dust storms due to the frequency of falling dust, little is known 
about the nature of these storms in terms of wind speed, direction, range of visibility and impact [3]. 

The terms “dust storm” and “sandstorm” are typically used interchangeably because the distinction 
between them is minimal [4], [5]. Some experts differentiate between sandstorms and dust storms based on the 
size of the soil particles. The phenomenon is referred to as a sandstorm if the size of the soil particles is between 
0.6 and 1 millimeter and a dust storm if the particles are smaller than 0.6 millimeters. The most prevalent type 
of storm in deserts is the dust storm, in which the wind carries clay and silt particles up to 0.5 mm in diameter. 
Dust is one of the main features associated with arid and semi-arid climatic conditions, characterized by 
climatic fluctuations that cause dust and sand to rise and carry them over a long distance, forming the so-called 
dust phenomenon. The dust storm particles physical properties of dust particles are different in size and shape, 
So it isn’t easy to get two samples with the same properties, which depend on the formation with 15 resources 
besides the physical and chemical, also configured the wind speed carrying it [6]. Table 2 are presented the 
characteristics of different types of dust. 


Table 1. The study area 


Province Stationid Longitude _Latitude 
Baghdad-airport 650 44 24 33 18 
Kut 664 45 49 32 30 
Karbala 656 44 03 32 34 
Najaf 670 44 19 3157 
Hilla 657 44 27 3227 


Table 2. Types of dust [7] 


Events type Horizontal visibility (km) Wind speed (ms-1) Particle diameter (j1m) 
Suspended dust (SD) 0-less 10 (km) 0-7 m/sec Less 1 
Rising dust (RD) 1-less 10 (km) 8 m/sec 1-10 
Dust storm (DS) Less 1 (km) 8 m/sec Less 100 
Sand storm (SS) Less 1 (km) 8 m/sec 250 


Since the prediction of dust storms is an urgent issue in the modern world, machine learning has 
created numerous opportunities for research in this area [8]. To forecast dust in a highly accurate manner and 
to assist in overcoming all dust change issues in the study area. An intelligent procedure comprised of artificial 
regression algorithms such as Bayesian ridge regression (BRR), gradient boosting regressor (GBR), stochastic 
gradient descent, linear regressor, and decision tree regression (DTR) can be implemented. These algorithms 
are simulations of nonlinear input data with a generation of synthetic mechanisms to study dust [9], [10]. 

To use a mathematical model, machine learning regression algorithms evaluate the relationship 
between a set of input features and a continuous output variable. The model is trained using labelled data in 
which the input features and output values are known. Once the model has been trained, it can forecast the 
output for new, unseen data [11]. All algorithms have a specific structure containing several components that 
create a model to predict a continuous numerical output variable based on input features [12]. Including data 
preparation, feature selection and model creation, evaluation, deployment and maintenance [13]. The specific 
structure of a regression algorithm can vary depending on the problem and the algorithm used. Several studies 
on dust forecasting using machine learning have been conducted. Particular attention is paid to forecasting dust 
types, but first, it is necessary to define the term forecast. A forecast is a dynamic filtering approach that predicts 
future values based on the past values of one or more time series [14], [15]. Different regression algorithms 
have been applied for predicting dust as connected in Khusfi et al. [16]. Which used multiple linear regression 
(MLR), bayesian regularized neural networks (BRNN), support vector machines (SVM), and random forests 
(RF) found that improve the understanding of predictability effectively and efficiently. Khusfi et al. [16] 
predicting the number of dusty days by using stochastic gradient boosting (SGB), conditional inference random 
forest (CRF), and SVM models based on three feature selection (FS) algorithms. Tan et al. [17]. Forecasted 
dustfall in Iraq using the Bayesian network (BN) to assist in anticipating the maximum and minimum dustfall 
that will occur in the following months. In this study, we evaluate how to apply intelligent framework 
algorithms in dust forecast using gathered datasets and a machine learning-based dust forecast system with the 
Python sklearn and pandas’ libraries. We employ the Sklearn sequencing model as our machine learning 
algorithm to learn and predict dust data. Also, utilize data from the Iraqi meteorological organization and 
seismology (IMOS). Department for weather forecasting. 
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2. METHOD 

The hybrid learning approach has beem illustrated in Figure | and has two stages: the historical data 
set must be inserted first. It was taken from IMOS, and previously collected datasets are split into 70% train 
and 30% test, which contains dust phenomena knowing that this data is raw. The collected data were 
preprocessed to improve their quality. The preprocessing dataset for each of the five regressions are input in 
the second stage, such as the Bayesian ridge (BR), decision tree (DT), gradient boosting (GB), linear regressor 
(LR), and stochastic gradient descent (SGD). 


y y 


Training dataset (70%) Testing dataset (30%) 

Historical 

dataset ¥ y 
From IMOS Preprocessing Preprocessing 
1. Missing value 1. Missing value 
2. Normalization 2. Normalization 
Select regression model 
¥ ¥ ¥ ¥ ¥ 
SGD LR GB DT BR 


, | J | 


[ prediction I prediction I prediction I prediction )[weticon | 


Figure 1. Proposed intelligent framework 


2.1. Dataset 

Hourly historical data prepared by IMOS and its stations spread in various regions of Iraq and 
accredited by WMO regarding the devices used for monitoring and the location of the station selection. Five 
stations distributed over five governorates in central Iraq were tested. These processed data contain dust 
phenomena such as wind speed (WS), wind direction (WD), range of visibility (ROV) and circumference 
(W1W1). And past weather and (WW), time and date as shown in Table 3. 


Table 3. IMOS dataset values used in our experiment 
station year month day hour ROV WD WS WW_ WI1W2 


650 2018 01 01 13 57 240 Ol 06 11 
650 2018 01 01 14 57 160 02 06 00 
650 2018 01 01 15 58 130 =—-02 06 11 
670 2022 10 31 09 58 15005 06 22 
670 2022 10 31 10 58 160 05 06 22 
670 2022 10 31 11 58 160 04 06 11 


We conducted continuous data from January 1, 2018, through October 31, 2022; IMOS monitored 
data continuously. Normalization improved the quality of the gathered data, as in (1). Equipment faults and 
other uncontrollable variables may cause gaps in dust phenomenon monitoring equipment data. Eliminated 
data records with multiple missing values. Linear interpolation was used to fill gaps when only one missing 
value was found. Data prcessing yielded 17,023 valid sets. 


X = (x — x min) /(x max — x min) (1) 


2.2. Supervised machine learning regression 

This section provides an overview of five regression algorithms, namely BR, DT, GB, LR, and SGD, 
that are employed for training the dust features in the dataset. That are employed for training the dust features 
in the dataset. In other words, the algorithm finds patterns and relationships in the data to map input features 
to output values. 
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2.2.1. Bayesian ridge regression 

The BRR is a type of linear regression that uses a Bayesian approach to regularization. Compared to 
the OLS estimator, the BRR coefficients are slightly shifted towards zero, stabilising them. The BRR model 
defines regression in probabilistic terms and allows for the inclusion of regularization parameters play a crucial 
role in the estimation procedure. The parameter for regularization is not deterministically set, but rather 
assigned a prior distribution, usually a gaussian distribution, and the model is trained using Bayesian inference 
[18]. The BRR can be used when there is insufficient or poorly distributed data to formulate linear regression 
as in (2). 


yoxpre Q) 


In which y is the n x 1 vector of the variable that is dependent, X is the n x p matrix of the variables that are 
independent, B is the p x | vector of regression coefficients, and is the error terms. 


2.2.2. Decision tree regression 

The DTR can accurately predict continuous variables like dust levels from input information. It has a 
tree-like structure with internal nodes representing features, branches representing decisions based on those 
features, and leaf nodes representing anticipated values or outcomes. The DTR recursively partitions data based 
on input feature values to estimate dust levels. Selecting the most informative features and splitting data at each 
node reduces prediction error. The DTR approach uses variance reduction and mean squared error to choose 
the optimal feature to split the data. It iteratively finds the feature and dividing point that most reduce prediction 
error. The DTR algorithm generates the tree by repeatedly partitioning data into feature-based groups during 
the training, recursively until a stopping requirement, such as a maximum depth, minimum number of samples 
in a leaf node, or minimum prediction error decrease. New data points follow the splitting criterion at each 
node to predict using the trained DTR model. The average or weighted average of target variable values in the 
leaf node reached by a new data point determines its anticipated value [19]. The DTR can handle numerical 
and categorical features, capture nonlinear feature-target connections, and be interpreted. If not regularized or 
the tree is too complex, it may overfit. The DTR can use wind direction, wind speed, and air quality information 
to predict dust levels. 


2.2.3. Gradient boosting regression 

The GB regression uses numerous weak regression models to predict continuous variables like dust 
levels. Iteratively developing an ensemble of models that repair each other’s faults works. The GBR minimizes 
a loss function by iteratively adding models to the ensemble. The program fits a decision tree-based weak 
regression model to the training data. The residuals, the disparities between actual dust levels and current 
ensemble predictions, are then calculated [20]. New weak regression models are trained to forecast residuals 
in subsequent iterations. Gradient descent determines the best ensemble update direction. New models are 
added to the ensemble until a stopping requirement is fulfilled. It can capture complex correlations and handle 
huge feature spaces, making it useful for predicting dust levels. Environmental science and air quality 
monitoring use it extensively, as in (3). 


Fm(x) = Fm—1(x) + hm(x) (3) 


Where F'm(x) is the prediction of the ensemble model at iteration m for input x, Fm — 1(x) is the prediction 
of the ensemble model at iteration m — 1 for input x, and hm(x) is the weak learner model at iteration m for 
input x. 


2.2.4. Linear regression 

Linear regression is a statistical algorithm that predicts continuous variables, such as dust levels 
leveraging a linear correlation between the input features and the target variable. In dust forecasting, LR 
assumes that the dust levels can be expressed as a linear combination of the input features. The algorithm 
estimates the coefficients of the linear equation that best fits the data, minimizing the difference between the 
predicted and actual dust levels. These coefficients represent the contribution of each input feature to the dust 
levels. By multiplying the feature values with their corresponding coefficients and summing them up, the 
algorithm generates predictions for dust levels [21]. LR is a widely used and interpretable method for dust 
forecasting, as it allows for understanding the quantitative impact of each input feature on the predicted dust 
levels. It is particularly effective when the features along with the goal variable exhibit a linear relationship as 
in (4). 
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2.2.5. Stochastic gradient descent regression 

Stochastic gradient descent regression. In 1951, Robbins and Monroe invented SGD. It is an efficient 
method, appropriately dubbed incremental gradient descent. In addition, it is a random approximation access 
that collects the median of past gradients and shifts them while exponentially diminishing. Therefore, it is a 
conventional and idealistic style with numerous benefits; for example, it provides, in addition to providing the 
ideal model complexity tools, the ideal performance time [22]. The SGD regression as in (5). 


y = BO + B1x1 + B2x2+...+ Bnxn (5) 


3. RESULTS AND DISCUSSION 

Three subsections explain the topic of study and offer the outcomes in this section. Documentation 
covers experimental setup, model implementation tools, and testing and evaluation methods. It also yields 
results. It concludes with simulation results and evaluation. 


3.1. Experimental setup 

Experiments were performed on a computer running 64-bit version of windows 20H2 with an Intel 8- 
core processor clocked at 2.80 GHz and 8192 MB of RAM. The predictive models for all algorithms were 
developed using python script version 3.6.5 with the integrated development and learning environment (IDLE) 
window. Keras in conjunction with TensorFlow. was utilized for LR, GBR, BRR, SGD, and DTR. 


3.2. Evaluation metrics 

Four assessment measures are utilized to determine the correctness and accuracy of the prediction 
models: mean absolute error (MAE), mean square error (MSE), reciprocal movement arm ergometer (RMAE), 
and root-mean-square proportional error (RMSPE). MAE computes the average difference between the original 
value and the forecasted value [23]. As a result, we can assess the similarity between the forecasts and the 
actual data. MAE is mathematically expressed as in (6). 


MAE = — x Yil, |Oi — Pi| (6) 


Where Oi represents the projected values, Pi represents the actual values, and N is the sample size. 

In contrast, the MSE measures the mean of the squares of the mistakes. MSE is the average squared 
difference between actual and anticipated values. It is utilized to assess the precision of regression problems. 
It could be represented numerically as (7). 


MSE =— x (01 = Pi? (7) 


Also been used RMAE is commonly used to evaluate the performance of regression models, especially when 
the data distribution has a high proportion of outliers, as it is less sensitive to outliers than RMSE. A lower 
RMAE indicates a better model performance, with a zero-value indicating that the model has perfect prediction 
accuracy, as in (8). 


|Oi-Pi| 
Ai 


1 
RMAE = - iar (8) 

On the other hand, RMSPE is used to evaluate the performance of predictive models in cases where 
the relative difference between the predicted and actual values is more important than the absolute difference 
as in (9) [24]. For example, when the values in the dataset span several orders of magnitude, RMSPE can 
provide a more meaningful evaluation of the model's performance than RMSE or RMAE. 


RMSPE = |: ye Axe 


reli‘ 


100% (9) 


3.3. Simulation results and evaluateion 

In this part, the performance of every model with datasets used in this research was evaluated. And 
this dataset contains three types of dust (suspended dust, rising dust, and dust storm), where dust was predicted 
in general to ensure the accuracy of the prediction of the system [25]. The predictive algorithm must account 
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for all of these pattern deviations. The MAE and MSE are used to measure the accuracy of the model’s 
predictions due to the low influence of major outliers. The root RMSE and standard deviation SD are used as 
a metric to measure the accuracy of the model’s predictions RMSPE. In the test, the window size is set to 
11915.4 hours, which implies that the first 11915.4 hours of data are used to predict the data for the next 213 
days with a probability of 70%. 

The training data year’s range is from 2018 to 13/May/2021, and the testing range is set to be 
14/May/2021 through December 2022 (30% from data). The first scenario starts with training a BRR the above 
data details. The rest of the four regression algorithms are applied, GB, stochastic gradient descent, and linear 
regression. The MAE of the predictor started to increase gradually, which is a significant decrease in the 
prediction accuracy of the model. Are presented in Table 4 which list the evaluation metrics for the algorithms 
applied and Figures 2 to 5 illustrated that. The comparison between the regression algorithm used, BRR, GBR, 
SGD, LR, and DTR models reveals that the GBR model has less error MSE. Regression methods perform 
better for most data series patterns than other regression models. A lower RMAE suggests a more effective 
model. The results demonstrated that the suggested ML-GBR model beats the other algorithms for most 
datasets used in this study, demonstrating this model’s forecasting capability. Furthermore, the test accuracy 
of features trained by five machine-learning regression models, namely GBR, DT, BRR, LR, and SGD, can be 
represented by their respective accuracy ratios of 91.65%, 91%, 84.365%, 84.363%, and 79%. These accuracy 
ratios are determined using a mean square error metric. The GBR model exhibits a minimum mean square error 
(MSE) of 8.345. On the other hand, the DT regression yields an MSE of 8.965, the BRR regression results in 
an MSE of 15.635, the LR regression shows an MSE of 15.637, and the SGD regression demonstrates an MSE 
of 20.966. illustrated in Figure 6. 


Table 4. Various models’ evaluation metrics for dust forecasting 


Algorithm MAE MSE RMAE RMSPE 
Bayesian ridge regressor 1.822714285 = 15.6355531422 —-1.3500793626 = 3.9541817285 
Decision tree regressor 0.2874739435 — 8.9651317036 —0.5361659664  =2.9941829776 
Gradient boosting regressor 0.4283970541 8.3453972637 ~—- 0.6545204765 ~=—-2.8888401242 
Linear regressor 1.8259554442 15.6377671095 —:1.3512791881  3.9544616712 


Stochastic gradient descent regressor _1.3773921547___20.96618639 11736235149 4.5788848419 
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Figure 4. RMAE for different regression models Figure 5. RMSPE for different regression models 
evaluation for the IMOS dataset evaluation for the IMOS dataset 
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Figure 6. Five machine learning regressions with mean square errors show dust test accuracy 


4. CONCLUSION 

Machine learning-regression methods tackle many real-world problems due to their properties. 
Regression algorithms anticipate and forecast well. This talent benefits humanity if the discoveries are realistic 
and accurate. This study focused on dust forecasting based on the measured performance error of five 
regression algorithm approaches. To achieve optimal results, must be considered, dust-related weather data 
quantity and quality. Many human and environmental factors affect dust-related weather physical processes, 
requiring extensive research. Choosing the best methods for evaluating historical weather data and discovering 
changes in their patterns over time, considering contemporary methods for renewing these data, choosing more 
regression algorithm techniques, studying their mathematical basis in the analysis, and compiling them into a 
single model before testing their accuracy on real-world meteorological data. Effective, simple, and user- 
friendly regression models have a transparent interface window. These models can be upgraded for all Iraqi 
central government agencies that address environmental, agricultural, and industrial challenges. Target 
provinces or stations by customizing the interface pane. Dust prediction and monitoring will be studied using 
deep learning and machine learning. The SVM, naive Bayes, logistic regression, and LSTM will be suggested 
to improve deep learning models. These techniques will be compared to the hybrid learning strategy in a 
supervised machine learning regression framework. 
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